A tourism company named "Visit with us" currently offers 5 types of packages - Basic, Standard, Deluxe, Super Deluxe, King. The company observed that 18% of the customers purchased the packages last year. However, it was difficult to identify the potential customers because customers were contacted at random without looking at the available information. The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. This time, the company wants to harness the available data of existing and potential customers to target the right customers.
As a Data scientist at "Visit With Us", I have to analyze the customers' data and information and provide recommendations to the Policy Maker and build a model to predict the potential customer who is going to purchase the newly introduced travel package. The model will be built to make predictions before a customer is contacted.
To analyze, visualize, and preprocess the data, and determine the best model from different ensemble models (bagging and boosting models along with tuned models) which can predict which customer is more likely to purchase the newly introduced travel package.
Each record in the database represents a customer's information. A detailed data dictionary can be found below.
Data Dictionary
Customer details:
Customer Interaction Data:
# import relevant libraries
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Libraries to split data, impute missing values
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.metrics import (
confusion_matrix,
classification_report,
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score,
)
from sklearn.model_selection import GridSearchCV
The nb_black extension is already loaded. To reload it, use: %reload_ext nb_black
# load the data
data = pd.read_excel("Tourism.xlsx", "Tourism")
# check a sample of the data to make sure it came in correctly
data.sample(n=10, random_state=101)
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 803 | 200803 | 0 | 34.0 | Company Invited | 1 | 9.0 | Salaried | Male | 2 | 4.0 | Basic | 3.0 | Divorced | 4.0 | 0 | 2 | 1 | 0.0 | Executive | 17979.0 |
| 589 | 200589 | 1 | 29.0 | Self Enquiry | 1 | 6.0 | Salaried | Female | 2 | 4.0 | Basic | 5.0 | Divorced | 2.0 | 1 | 2 | 0 | 0.0 | Executive | 17319.0 |
| 3736 | 203736 | 0 | 40.0 | Company Invited | 3 | 27.0 | Salaried | Male | 3 | 4.0 | Deluxe | 3.0 | Married | 4.0 | 0 | 3 | 1 | 1.0 | Manager | 22805.0 |
| 3996 | 203996 | 0 | 56.0 | Self Enquiry | 3 | 7.0 | Salaried | Male | 4 | 4.0 | Standard | 3.0 | Married | 5.0 | 0 | 1 | 0 | 3.0 | Senior Manager | 28917.0 |
| 3491 | 203491 | 0 | 34.0 | Company Invited | 3 | 14.0 | Small Business | Male | 3 | 4.0 | Deluxe | 5.0 | Divorced | 2.0 | 0 | 3 | 0 | 1.0 | Manager | 23051.0 |
| 1563 | 201563 | 0 | 46.0 | Company Invited | 1 | 6.0 | Small Business | Male | 2 | 4.0 | Standard | 5.0 | Married | 3.0 | 1 | 1 | 1 | 1.0 | Senior Manager | 25673.0 |
| 2503 | 202503 | 0 | 38.0 | Self Enquiry | 1 | 7.0 | Salaried | Male | 3 | 5.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 5 | 1 | 2.0 | Manager | 24671.0 |
| 160 | 200160 | 0 | 22.0 | Self Enquiry | 1 | 25.0 | Small Business | Male | 3 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 0 | 2 | 0 | 1.0 | Executive | 17323.0 |
| 3114 | 203114 | 0 | 28.0 | Self Enquiry | 1 | 11.0 | Salaried | Female | 4 | 4.0 | Basic | 3.0 | Single | 3.0 | 0 | 2 | 1 | 2.0 | Executive | 20996.0 |
| 1619 | 201619 | 0 | 19.0 | Self Enquiry | 1 | 9.0 | Small Business | Female | 3 | 3.0 | Basic | 4.0 | Single | 2.0 | 0 | 3 | 1 | 0.0 | Executive | 16483.0 |
# check the shape
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns in the data.")
There are 4888 rows and 20 columns in the data.
# check that the ID column is unique
data.CustomerID.nunique()
4888
# checking for duplicate values
df = data.copy()
df = df.drop("CustomerID", axis=1)
# check datatypes of the columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4662 non-null float64 2 TypeofContact 4863 non-null object 3 CityTier 4888 non-null int64 4 DurationOfPitch 4637 non-null float64 5 Occupation 4888 non-null object 6 Gender 4888 non-null object 7 NumberOfPersonVisiting 4888 non-null int64 8 NumberOfFollowups 4843 non-null float64 9 ProductPitched 4888 non-null object 10 PreferredPropertyStar 4862 non-null float64 11 MaritalStatus 4888 non-null object 12 NumberOfTrips 4748 non-null float64 13 Passport 4888 non-null int64 14 PitchSatisfactionScore 4888 non-null int64 15 OwnCar 4888 non-null int64 16 NumberOfChildrenVisiting 4822 non-null float64 17 Designation 4888 non-null object 18 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(6), object(6) memory usage: 725.7+ KB
# check which columns have null values
df.isna().sum()[df.isna().sum() > 0]
Age 226 TypeofContact 25 DurationOfPitch 251 NumberOfFollowups 45 PreferredPropertyStar 26 NumberOfTrips 140 NumberOfChildrenVisiting 66 MonthlyIncome 233 dtype: int64
# check the unique values for the categorical variables
# will treat CityTier as categorical although it is a ordinal variable to see its importance in the model
cat_cols = list(df.select_dtypes(include="object").columns) + [
"ProdTaken",
"CityTier",
"Passport",
"OwnCar",
]
for i in cat_cols:
print(df[i].value_counts(normalize=True))
print("-" * 50)
Self Enquiry 0.708205 Company Invited 0.291795 Name: TypeofContact, dtype: float64 -------------------------------------------------- Salaried 0.484452 Small Business 0.426350 Large Business 0.088789 Free Lancer 0.000409 Name: Occupation, dtype: float64 -------------------------------------------------- Male 0.596563 Female 0.371727 Fe Male 0.031710 Name: Gender, dtype: float64 -------------------------------------------------- Basic 0.376841 Deluxe 0.354337 Standard 0.151800 Super Deluxe 0.069967 King 0.047054 Name: ProductPitched, dtype: float64 -------------------------------------------------- Married 0.478723 Divorced 0.194354 Single 0.187398 Unmarried 0.139525 Name: MaritalStatus, dtype: float64 -------------------------------------------------- Executive 0.376841 Manager 0.354337 Senior Manager 0.151800 AVP 0.069967 VP 0.047054 Name: Designation, dtype: float64 -------------------------------------------------- 0 0.811784 1 0.188216 Name: ProdTaken, dtype: float64 -------------------------------------------------- 1 0.652619 3 0.306874 2 0.040507 Name: CityTier, dtype: float64 -------------------------------------------------- 0 0.709083 1 0.290917 Name: Passport, dtype: float64 -------------------------------------------------- 1 0.620295 0 0.379705 Name: OwnCar, dtype: float64 --------------------------------------------------
# replace "Fe Male" with "Female"
df.Gender.replace("Fe Male", "Female", inplace=True)
df.Gender.value_counts()
Male 2916 Female 1972 Name: Gender, dtype: int64
# look at the statistical summary of the data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4888.0 | NaN | NaN | NaN | 0.188216 | 0.390925 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Age | 4662.0 | NaN | NaN | NaN | 37.622265 | 9.316387 | 18.0 | 31.0 | 36.0 | 44.0 | 61.0 |
| TypeofContact | 4863 | 2 | Self Enquiry | 3444 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CityTier | 4888.0 | NaN | NaN | NaN | 1.654255 | 0.916583 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 4637.0 | NaN | NaN | NaN | 15.490835 | 8.519643 | 5.0 | 9.0 | 13.0 | 20.0 | 127.0 |
| Occupation | 4888 | 4 | Salaried | 2368 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 4888 | 2 | Male | 2916 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfPersonVisiting | 4888.0 | NaN | NaN | NaN | 2.905074 | 0.724891 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
| NumberOfFollowups | 4843.0 | NaN | NaN | NaN | 3.708445 | 1.002509 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| ProductPitched | 4888 | 5 | Basic | 1842 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PreferredPropertyStar | 4862.0 | NaN | NaN | NaN | 3.581037 | 0.798009 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| MaritalStatus | 4888 | 4 | Married | 2340 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| NumberOfTrips | 4748.0 | NaN | NaN | NaN | 3.236521 | 1.849019 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| Passport | 4888.0 | NaN | NaN | NaN | 0.290917 | 0.454232 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 4888.0 | NaN | NaN | NaN | 3.078151 | 1.365792 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 4888.0 | NaN | NaN | NaN | 0.620295 | 0.485363 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | NaN | NaN | NaN | 1.187267 | 0.857861 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| Designation | 4888 | 5 | Executive | 1842 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| MonthlyIncome | 4655.0 | NaN | NaN | NaN | 23619.853491 | 5380.698361 | 1000.0 | 20346.0 | 22347.0 | 25571.0 | 98678.0 |
Significant observations:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
## plot histogram and boxplot for the numerical features
num_cols = [
"Age",
"DurationOfPitch",
"MonthlyIncome",
"NumberOfPersonVisiting",
"PitchSatisfactionScore",
"NumberOfFollowups",
"PreferredPropertyStar",
"NumberOfTrips",
"NumberOfChildrenVisiting",
]
for i in num_cols:
print(i)
histogram_boxplot(df, i)
plt.show()
print(
" ****************************************************************** "
) ## To create a separator
Age
****************************************************************** DurationOfPitch
****************************************************************** MonthlyIncome
****************************************************************** NumberOfPersonVisiting
****************************************************************** PitchSatisfactionScore
****************************************************************** NumberOfFollowups
****************************************************************** PreferredPropertyStar
****************************************************************** NumberOfTrips
****************************************************************** NumberOfChildrenVisiting
******************************************************************
## Barplot for the categorical features
for i in cat_cols:
print(i)
labeled_barplot(df, i, perc=True)
plt.show()
print(
" ****************************************************************** "
) ## To create a separator
TypeofContact
****************************************************************** Occupation
****************************************************************** Gender
****************************************************************** ProductPitched
****************************************************************** MaritalStatus
****************************************************************** Designation
****************************************************************** ProdTaken
****************************************************************** CityTier
****************************************************************** Passport
****************************************************************** OwnCar
******************************************************************
DurationOfPitch, MonthlyIncome, and NumberOfTrips had some extreme outliers based on the univariate distributions
# look at leftmost outliers for MonthlyIncome
df1 = df.copy()
df1.sort_values(by="MonthlyIncome").head()
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 142 | 0 | 38.0 | Self Enquiry | 1 | 9.0 | Large Business | Female | 2 | 3.0 | Deluxe | 3.0 | Single | 4.0 | 1 | 5 | 0 | 0.0 | Manager | 1000.0 |
| 2586 | 0 | 39.0 | Self Enquiry | 1 | 10.0 | Large Business | Female | 3 | 4.0 | Deluxe | 3.0 | Single | 5.0 | 1 | 5 | 0 | 1.0 | Manager | 4678.0 |
| 513 | 1 | 20.0 | Self Enquiry | 1 | 16.0 | Small Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 2.0 | 1 | 5 | 0 | 0.0 | Executive | 16009.0 |
| 1983 | 1 | 20.0 | Self Enquiry | 1 | 16.0 | Small Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 16009.0 |
| 2197 | 0 | 18.0 | Company Invited | 1 | 11.0 | Salaried | Male | 3 | 3.0 | Basic | 3.0 | Single | 2.0 | 0 | 1 | 0 | 1.0 | Executive | 16051.0 |
# look at rightmost outliers for MonthlyIncome
df1.sort_values(by="MonthlyIncome", na_position="last", ascending=False).head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2482 | 0 | 37.0 | Self Enquiry | 1 | 12.0 | Salaried | Female | 3 | 5.0 | Basic | 5.0 | Divorced | 2.0 | 1 | 2 | 1 | 1.0 | Executive | 98678.0 |
| 38 | 0 | 36.0 | Self Enquiry | 1 | 11.0 | Salaried | Female | 2 | 4.0 | Basic | NaN | Divorced | 1.0 | 1 | 2 | 1 | 0.0 | Executive | 95000.0 |
| 4104 | 0 | 53.0 | Self Enquiry | 1 | 7.0 | Salaried | Male | 4 | 5.0 | King | NaN | Married | 2.0 | 0 | 1 | 1 | 3.0 | VP | 38677.0 |
| 2634 | 0 | 53.0 | Self Enquiry | 1 | 7.0 | Salaried | Male | 4 | 5.0 | King | NaN | Divorced | 2.0 | 0 | 2 | 1 | 2.0 | VP | 38677.0 |
| 4660 | 0 | 42.0 | Company Invited | 1 | 14.0 | Salaried | Female | 3 | 6.0 | King | NaN | Married | 3.0 | 0 | 4 | 1 | 2.0 | VP | 38651.0 |
# see relationship between Designation and MonthlyIncome
sns.boxplot(data=df1, x="Designation", y="MonthlyIncome")
plt.show()
# impute outliers with the median
df1.loc[df1.MonthlyIncome < 15000, "MonthlyIncome"] = df1[df1.Designation == "Manager"][
"MonthlyIncome"
].median()
df1.loc[df1.MonthlyIncome > 40000, "MonthlyIncome"] = df1[
df1.Designation == "Executive"
]["MonthlyIncome"].median()
# find the values for DurationofPitch that are greater than 4*IQR from the median
quartiles = np.quantile(
df1["DurationOfPitch"][df1["DurationOfPitch"].notnull()], [0.25, 0.75]
)
dop_4iqr = 4 * (quartiles[1] - quartiles[0])
outlier_dop = df1.loc[
np.abs(df1["DurationOfPitch"] - df1["DurationOfPitch"].median()) > dop_4iqr,
"DurationOfPitch",
]
print(outlier_dop.sort_values(ascending=False).count() / df1.shape[0] * 100, "%")
df1.loc[outlier_dop.sort_values().index]
0.04091653027823241 %
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1434 | 0 | NaN | Company Invited | 3 | 126.0 | Salaried | Male | 2 | 3.0 | Basic | 3.0 | Married | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 18482.0 |
| 3878 | 0 | 53.0 | Company Invited | 3 | 127.0 | Salaried | Male | 3 | 4.0 | Basic | 3.0 | Married | 4.0 | 0 | 1 | 1 | 2.0 | Executive | 22160.0 |
# drop outliers in DurationOfPitch
df1 = df1[~df1.index.isin(list(outlier_dop.index))]
# find the values for NumberOfTrips that are greater than 4*IQR from the median
quartiles = np.quantile(
df1["NumberOfTrips"][df1["NumberOfTrips"].notnull()], [0.25, 0.75]
)
not_4iqr = 4 * (quartiles[1] - quartiles[0])
outlier_not = df1.loc[
np.abs(df1["NumberOfTrips"] - df1["NumberOfTrips"].median()) > not_4iqr,
"NumberOfTrips",
]
print(outlier_not.sort_values(ascending=False).count() / df1.shape[0] * 100, "%")
df1.loc[outlier_not.sort_values().index]
0.08186655751125665 %
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 385 | 1 | 30.0 | Company Invited | 1 | 10.0 | Large Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 19.0 | 1 | 4 | 1 | 1.0 | Executive | 17285.0 |
| 2829 | 1 | 31.0 | Company Invited | 1 | 11.0 | Large Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 20.0 | 1 | 4 | 1 | 2.0 | Executive | 20963.0 |
| 816 | 0 | 39.0 | Company Invited | 1 | 15.0 | Salaried | Male | 3 | 3.0 | Deluxe | 4.0 | Unmarried | 21.0 | 0 | 2 | 1 | 0.0 | Manager | 21782.0 |
| 3260 | 0 | 40.0 | Company Invited | 1 | 16.0 | Salaried | Male | 4 | 4.0 | Deluxe | 4.0 | Unmarried | 22.0 | 0 | 2 | 1 | 1.0 | Manager | 25460.0 |
# drop outliers in NumberOfTrips
df1 = df1[~df1.index.isin(list(outlier_not.index))]
# view distributions after removal of outliers
cols = [
"DurationOfPitch",
"MonthlyIncome",
"NumberOfTrips",
]
for i in cols:
print(i)
histogram_boxplot(df1, i)
plt.show()
print(
" ****************************************************************** "
) ## To create a separator
DurationOfPitch
****************************************************************** MonthlyIncome
****************************************************************** NumberOfTrips
******************************************************************
With extreme values removed, we will get a better idea of the relationships between variables.
# correlation plot
plt.figure(figsize=(15, 7))
sns.heatmap(df1.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
We will see how the target variable varies depending on other features.
# plot numerical features against each other
sns.pairplot(data=df1, hue="ProdTaken", vars=num_cols)
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# plot numerical variables with respect to target
sns.set(font_scale=1)
for i in num_cols:
distribution_plot_wrt_target(df1, i, "ProdTaken")
plt.show()
print("*" * 100)
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
# examing distribution of Age and MonthlyIncome by target
sns.swarmplot(data=df1, x="ProdTaken", y="Age")
plt.show()
sns.swarmplot(data=df1, x="ProdTaken", y="MonthlyIncome")
plt.show()
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# plot categorical variables with respect to target
othercols = cat_cols.copy()
othercols.remove("ProdTaken")
for i in othercols:
print(i)
stacked_barplot(df1, i, "ProdTaken")
plt.show()
print(
" ****************************************************************** "
) ## To create a separator
TypeofContact ProdTaken 0 1 All TypeofContact All 3942 915 4857 Self Enquiry 2837 607 3444 Company Invited 1105 308 1413 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Occupation ProdTaken 0 1 All Occupation All 3964 918 4882 Salaried 1950 414 2364 Small Business 1700 384 2084 Large Business 314 118 432 Free Lancer 0 2 2 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Gender ProdTaken 0 1 All Gender All 3964 918 4882 Male 2334 576 2910 Female 1630 342 1972 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** ProductPitched ProdTaken 0 1 All ProductPitched All 3964 918 4882 Basic 1288 550 1838 Deluxe 1526 204 1730 Standard 618 124 742 King 210 20 230 Super Deluxe 322 20 342 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** MaritalStatus ProdTaken 0 1 All MaritalStatus All 3964 918 4882 Married 2012 326 2338 Single 612 302 914 Unmarried 514 166 680 Divorced 826 124 950 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Designation ProdTaken 0 1 All Designation All 3964 918 4882 Executive 1288 550 1838 Manager 1526 204 1730 Senior Manager 618 124 742 AVP 322 20 342 VP 210 20 230 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** CityTier ProdTaken 0 1 All CityTier All 3964 918 4882 1 2668 518 3186 3 1144 354 1498 2 152 46 198 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Passport ProdTaken 0 1 All Passport All 3964 918 4882 1 928 492 1420 0 3036 426 3462 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** OwnCar ProdTaken 0 1 All OwnCar All 3964 918 4882 1 2468 558 3026 0 1496 360 1856 ------------------------------------------------------------------------------------------------------------------------
******************************************************************
# see how TypeofContact varies with other variables as it did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
for i, variable in enumerate(num_cols):
plt.subplot(5, 2, i + 1)
sns.boxplot(
df1["TypeofContact"], df1[variable], hue=df1["ProdTaken"], palette="PuBu"
)
plt.tight_layout()
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.title(variable)
plt.show()
# see how Gender varies with other variables as it did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
for i, variable in enumerate(num_cols):
plt.subplot(5, 2, i + 1)
sns.boxplot(df1["Gender"], df1[variable], hue=df1["ProdTaken"], palette="PuBu")
plt.tight_layout()
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.title(variable)
plt.show()
# see how OwnCar varies with other variables as those variables did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
for i, variable in enumerate(num_cols):
plt.subplot(5, 2, i + 1)
sns.boxplot(df1["OwnCar"], df1[variable], hue=df1["ProdTaken"], palette="PuBu")
plt.tight_layout()
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.title(variable)
plt.show()
Create a profile of the customers (for example demographic information) who purchased a package. The profile has to be created for each of the 5 packages.
# filter for customers who purchased a package
df_cust = df1[df1.ProdTaken == 1]
order = ["Basic", "Standard", "Deluxe", "Super Deluxe", "King"]
# distribution of packages bought
sns.set(font_scale=1)
sns.countplot(data=df_cust, x="ProductPitched", order=order)
<AxesSubplot:xlabel='ProductPitched', ylabel='count'>
# plot package type vs numerical variables
plt.figure(figsize=(25, 30))
sns.set(font_scale=2)
for i, name in enumerate(num_cols):
plt.subplot(5, 2, i + 1)
sns.boxplot(data=df_cust, y=name, x="ProductPitched", order=order)
plt.tight_layout()
plt.title(name)
# plot statistics by Product Pitched
for i in order:
print("Statistics for ", i)
df_sub = df_cust[df_cust.ProductPitched == i]
display(df_sub.describe().T)
print("*" * 50)
Statistics for Basic
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 550.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Age | 513.0 | 31.292398 | 9.088340 | 18.0 | 25.0 | 30.0 | 35.0 | 59.0 |
| CityTier | 550.0 | 1.512727 | 0.833509 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| DurationOfPitch | 530.0 | 15.811321 | 7.915090 | 6.0 | 9.0 | 14.0 | 22.0 | 36.0 |
| NumberOfPersonVisiting | 550.0 | 2.907273 | 0.701639 | 2.0 | 2.0 | 3.0 | 3.0 | 4.0 |
| NumberOfFollowups | 546.0 | 3.952381 | 0.968079 | 1.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| PreferredPropertyStar | 550.0 | 3.774545 | 0.862119 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 |
| NumberOfTrips | 545.0 | 3.166972 | 1.836019 | 1.0 | 2.0 | 3.0 | 3.0 | 8.0 |
| Passport | 550.0 | 0.581818 | 0.493709 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 550.0 | 3.210909 | 1.354702 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 550.0 | 0.570909 | 0.495397 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 549.0 | 1.220401 | 0.867427 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| MonthlyIncome | 527.0 | 20165.466793 | 3317.026073 | 16009.0 | 17546.0 | 20582.0 | 21406.5 | 37868.0 |
************************************************** Statistics for Standard
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 124.0 | 1.000000 | 0.000000 | 1.0 | 1.00 | 1.0 | 1.00 | 1.0 |
| Age | 123.0 | 41.008130 | 9.876695 | 19.0 | 33.00 | 38.0 | 49.00 | 60.0 |
| CityTier | 124.0 | 2.096774 | 0.966255 | 1.0 | 1.00 | 3.0 | 3.00 | 3.0 |
| DurationOfPitch | 123.0 | 19.065041 | 9.048811 | 6.0 | 11.00 | 17.0 | 29.00 | 36.0 |
| NumberOfPersonVisiting | 124.0 | 2.967742 | 0.709236 | 2.0 | 2.00 | 3.0 | 3.00 | 4.0 |
| NumberOfFollowups | 124.0 | 3.935484 | 0.908335 | 1.0 | 3.00 | 4.0 | 4.25 | 6.0 |
| PreferredPropertyStar | 123.0 | 3.731707 | 0.878460 | 3.0 | 3.00 | 3.0 | 5.00 | 5.0 |
| NumberOfTrips | 123.0 | 3.016260 | 1.815163 | 1.0 | 2.00 | 2.0 | 4.00 | 8.0 |
| Passport | 124.0 | 0.387097 | 0.489062 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 124.0 | 3.467742 | 1.309350 | 1.0 | 3.00 | 3.0 | 5.00 | 5.0 |
| OwnCar | 124.0 | 0.661290 | 0.475191 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 123.0 | 1.121951 | 0.901596 | 0.0 | 0.00 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 124.0 | 26035.419355 | 3593.290353 | 17372.0 | 23974.75 | 25711.0 | 28628.00 | 38395.0 |
************************************************** Statistics for Deluxe
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 204.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Age | 198.0 | 37.641414 | 8.469575 | 21.0 | 32.0 | 35.5 | 44.0 | 59.0 |
| CityTier | 204.0 | 2.411765 | 0.913532 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 |
| DurationOfPitch | 180.0 | 19.100000 | 9.227176 | 6.0 | 11.0 | 16.0 | 28.0 | 36.0 |
| NumberOfPersonVisiting | 204.0 | 2.950980 | 0.707141 | 2.0 | 2.0 | 3.0 | 3.0 | 4.0 |
| NumberOfFollowups | 200.0 | 3.970000 | 1.051011 | 1.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| PreferredPropertyStar | 203.0 | 3.699507 | 0.857899 | 3.0 | 3.0 | 3.0 | 5.0 | 5.0 |
| NumberOfTrips | 202.0 | 3.702970 | 2.022483 | 1.0 | 2.0 | 3.0 | 5.0 | 8.0 |
| Passport | 204.0 | 0.490196 | 0.501134 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 204.0 | 3.039216 | 1.278250 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 204.0 | 0.607843 | 0.489432 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 203.0 | 1.172414 | 0.841279 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| MonthlyIncome | 195.0 | 23106.215385 | 3592.466947 | 17086.0 | 20744.0 | 23186.0 | 24506.0 | 38525.0 |
************************************************** Statistics for Super Deluxe
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 20.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.00 | 1.0 |
| Age | 20.0 | 43.500000 | 4.839530 | 39.0 | 40.0 | 42.0 | 45.25 | 56.0 |
| CityTier | 20.0 | 2.600000 | 0.820783 | 1.0 | 3.0 | 3.0 | 3.00 | 3.0 |
| DurationOfPitch | 20.0 | 18.500000 | 7.330542 | 8.0 | 15.0 | 18.5 | 20.00 | 31.0 |
| NumberOfPersonVisiting | 20.0 | 2.700000 | 0.656947 | 2.0 | 2.0 | 3.0 | 3.00 | 4.0 |
| NumberOfFollowups | 20.0 | 3.100000 | 1.618967 | 1.0 | 2.0 | 3.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 20.0 | 3.600000 | 0.820783 | 3.0 | 3.0 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 19.0 | 3.263158 | 2.490919 | 1.0 | 1.0 | 2.0 | 5.50 | 8.0 |
| Passport | 20.0 | 0.600000 | 0.502625 | 0.0 | 0.0 | 1.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 20.0 | 3.800000 | 1.005249 | 3.0 | 3.0 | 3.0 | 5.00 | 5.0 |
| OwnCar | 20.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 20.0 | 1.200000 | 0.833509 | 0.0 | 1.0 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 20.0 | 29823.800000 | 3520.426404 | 21151.0 | 28129.5 | 29802.5 | 31997.25 | 37502.0 |
************************************************** Statistics for King
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 20.0 | 1.000000 | 0.000000 | 1.0 | 1.00 | 1.0 | 1.0 | 1.0 |
| Age | 20.0 | 48.900000 | 9.618513 | 27.0 | 42.00 | 52.5 | 56.0 | 59.0 |
| CityTier | 20.0 | 1.800000 | 1.005249 | 1.0 | 1.00 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 20.0 | 10.500000 | 4.135851 | 8.0 | 8.00 | 9.0 | 9.0 | 19.0 |
| NumberOfPersonVisiting | 20.0 | 2.900000 | 0.718185 | 2.0 | 2.00 | 3.0 | 3.0 | 4.0 |
| NumberOfFollowups | 20.0 | 4.300000 | 1.128576 | 3.0 | 3.00 | 4.0 | 5.0 | 6.0 |
| PreferredPropertyStar | 16.0 | 3.750000 | 0.683130 | 3.0 | 3.00 | 4.0 | 4.0 | 5.0 |
| NumberOfTrips | 17.0 | 3.411765 | 1.938389 | 1.0 | 2.00 | 3.0 | 4.0 | 7.0 |
| Passport | 20.0 | 0.600000 | 0.502625 | 0.0 | 0.00 | 1.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 20.0 | 3.300000 | 1.218282 | 1.0 | 3.00 | 3.0 | 4.0 | 5.0 |
| OwnCar | 20.0 | 0.900000 | 0.307794 | 0.0 | 1.00 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 16.0 | 1.437500 | 0.892095 | 0.0 | 1.00 | 1.0 | 2.0 | 3.0 |
| MonthlyIncome | 20.0 | 34672.100000 | 5577.603833 | 17517.0 | 34470.25 | 34859.0 | 38223.0 | 38537.0 |
**************************************************
# plot categorical variables with respect to ProductPitched
othercols = cat_cols.copy()
othercols.remove("ProductPitched")
sns.set(font_scale=1)
for i in othercols:
print(i)
stacked_barplot(df_cust, i, "ProductPitched")
plt.show()
print(
" ****************************************************************** "
) ## To create a separator
TypeofContact ProductPitched Basic Deluxe King Standard Super Deluxe All TypeofContact All 547 204 20 124 20 915 Company Invited 192 68 0 32 16 308 Self Enquiry 355 136 20 92 4 607 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Occupation ProductPitched Basic Deluxe King Standard Super Deluxe All Occupation All 550 204 20 124 20 918 Salaried 260 80 4 54 16 414 Small Business 202 108 12 58 4 384 Free Lancer 2 0 0 0 0 2 Large Business 86 16 4 12 0 118 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Gender ProductPitched Basic Deluxe King Standard Super Deluxe All Gender All 550 204 20 124 20 918 Male 342 134 8 76 16 576 Female 208 70 12 48 4 342 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** MaritalStatus ProductPitched Basic Deluxe King Standard Super Deluxe All MaritalStatus All 550 204 20 124 20 918 Single 228 45 8 11 10 302 Married 188 68 6 56 8 326 Unmarried 74 59 0 31 2 166 Divorced 60 32 6 26 0 124 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Designation ProductPitched Basic Deluxe King Standard Super Deluxe All Designation AVP 0 0 0 0 20 20 All 550 204 20 124 20 918 Executive 550 0 0 0 0 550 Manager 0 204 0 0 0 204 Senior Manager 0 0 0 124 0 124 VP 0 0 20 0 0 20 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** ProdTaken ProductPitched Basic Deluxe King Standard Super Deluxe All ProdTaken 1 550 204 20 124 20 918 All 550 204 20 124 20 918 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** CityTier ProductPitched Basic Deluxe King Standard Super Deluxe All CityTier All 550 204 20 124 20 918 3 122 144 8 64 16 354 1 390 60 12 52 4 518 2 38 0 0 8 0 46 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Passport ProductPitched Basic Deluxe King Standard Super Deluxe All Passport All 550 204 20 124 20 918 1 320 100 12 48 12 492 0 230 104 8 76 8 426 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** OwnCar ProductPitched Basic Deluxe King Standard Super Deluxe All OwnCar 1 314 124 18 82 20 558 All 550 204 20 124 20 918 0 236 80 2 42 0 360 ------------------------------------------------------------------------------------------------------------------------
******************************************************************
# explore how Age varies with gender and marital status for each package
sns.set(font_scale=1)
sns.catplot(data=df_cust, x="MaritalStatus", y="Age", col="ProductPitched", kind="bar")
plt.show()
sns.catplot(data=df_cust, x="Gender", y="Age", col="ProductPitched", kind="bar")
plt.show()
# feature extraction
df_cust["Age_bin"] = pd.cut(
x=df_cust["Age"],
bins=[18, 30, 40, 50, 61],
labels=["18-30", "31-40", "41-50", ">50"],
)
df_cust["Income_bin"] = pd.cut(
x=df_cust["MonthlyIncome"],
bins=[15000, 20000, 25000, 30000, 35000, 40000],
labels=["15K - 20K", "20K - 25K", "25K - 30K", "30K - 35K", ">35K"],
)
# see how the product varies with age and income
sns.countplot(data=df_cust, x="Age_bin", hue="ProductPitched")
plt.show()
sns.countplot(data=df_cust, x="Income_bin", hue="ProductPitched")
plt.show()
Let's looking at the columns that have missing values and see if we can find a relationship for imputation.
# find number of missing values for each column
df2 = df1.copy()
df2.isna().sum()[df2.isna().sum() > 0]
Age 225 TypeofContact 25 DurationOfPitch 251 NumberOfFollowups 45 PreferredPropertyStar 26 NumberOfTrips 140 NumberOfChildrenVisiting 66 MonthlyIncome 233 dtype: int64
# MonthlyIncome - MonthlyIncome most likely varies based on Occupation and Desigination
plt.figure(figsize=(15, 10))
sns.set(font_scale=1)
sns.boxplot(data=df2, x="Occupation", y="MonthlyIncome", hue="Designation")
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x272da5ac940>
# impute missing values in MonthlyIncome based on Occupation and Designation
df2["MonthlyIncome"].fillna(
value=df2.groupby(["Occupation", "Designation"])["MonthlyIncome"].transform(
np.median
),
inplace=True,
)
# Age varies based on Gender and Designation
plt.figure(figsize=(15, 10))
sns.set(font_scale=1)
sns.boxplot(data=df2, x="Gender", y="Age", hue="Designation")
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x272d93d2bb0>
# impute missing values in Age based on Gender and Designation
df2["Age"].fillna(
value=df2.groupby(["Gender", "Designation"])["Age"].transform(np.median),
inplace=True,
)
# NumberOfChildrenVisiting was moderately correlated with NumberOfPersonVisiting
df2["NumberOfChildrenVisiting"].fillna(
value=df2.groupby(["NumberOfPersonVisiting"])["NumberOfChildrenVisiting"].transform(
np.median
),
inplace=True,
)
# DurationOfPitch most likely varies with what product is being pitched
plt.figure(figsize=(15, 8))
sns.boxplot(data=df2, y="DurationOfPitch", x="ProductPitched")
<AxesSubplot:xlabel='ProductPitched', ylabel='DurationOfPitch'>
# impute missing values in DurationOfPitch
df2["DurationOfPitch"].fillna(
value=df2.groupby(["ProductPitched"])["DurationOfPitch"].transform(np.median),
inplace=True,
)
# relationships with NumberOfTrips across categorical variables
plt.figure(figsize=(25, 30))
sns.set(font_scale=2)
for i, name in enumerate(cat_cols):
plt.subplot(5, 2, i + 1)
sns.boxplot(data=df2, x=name, y="NumberOfTrips")
plt.tight_layout()
plt.title(name)
# check how many missing values are freelancers
df2[df2.NumberOfTrips.isna()].Designation.value_counts()
VP 82 AVP 50 Executive 5 Manager 2 Senior Manager 1 Name: Designation, dtype: int64
# impute NumberOfTrips according to MaritalStatus since there are no freelancers in the missing data
df2["NumberOfTrips"].fillna(
value=df2.groupby(["MaritalStatus"])["NumberOfTrips"].transform(np.median),
inplace=True,
)
# check how many missing alues are left
df2.isna().sum()[df2.isna().sum() > 0]
TypeofContact 25 NumberOfFollowups 45 PreferredPropertyStar 26 dtype: int64
# impute the remaining column missing variables with the median or mode
df2["TypeofContact"].fillna("Self Enquiry", inplace=True) # Self Inquiry is the mode
df2["PreferredPropertyStar"].fillna(
value=df2["PreferredPropertyStar"].median(), inplace=True
)
df2["NumberOfFollowups"].fillna(value=df2["NumberOfFollowups"].median(), inplace=True)
# confirm that there are no more missing values
df2.isna().sum()[df2.isna().sum() > 0]
Series([], dtype: int64)
# check correlations after imputations to see if anything has changed
sns.set(font_scale=1)
plt.figure(figsize=(15, 7))
sns.heatmap(df2.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
# convert categorical variables to category datatype
df3 = df2.copy()
df3[cat_cols] = df3[cat_cols].astype("category")
We will drop the customer interaction data since we are trying to predict new, potential customers before pitching the package to them.
# create X and Y variables
X = df3.drop(
[
"ProdTaken",
"PitchSatisfactionScore",
"ProductPitched",
"NumberOfFollowups",
"DurationOfPitch",
],
axis=1,
)
y = df3["ProdTaken"]
# get dummy variables
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=101, stratify=y
)
# verify that training and test sets have same distribution of 0s and 1s
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3417, 22) Shape of test set : (1465, 22) Percentage of classes in training set: 0 0.811823 1 0.188177 Name: ProdTaken, dtype: float64 Percentage of classes in test set: 0 0.812287 1 0.187713 Name: ProdTaken, dtype: float64
Predicting a customer will purchase the package but in reality the customer would not purchase it. - Loss of resources
Predicting a customer will not purchase the package but in reality the customer would have purchased the package. - Loss of opportunity and revenue
recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Fitting a decision tree model with default parameters
d_tree = DecisionTreeClassifier(random_state=101)
d_tree.fit(X_train, y_train)
# Calculating different metrics
dtree_model_train_perf = model_performance_classification_sklearn(
d_tree, X_train, y_train
)
print("Training performance:\n", dtree_model_train_perf)
dtree_model_test_perf = model_performance_classification_sklearn(d_tree, X_test, y_test)
print("Testing performance:\n", dtree_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(d_tree, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.874403 0.654545 0.669145 0.661765
# Choose the type of classifier.
dtree_tuned = DecisionTreeClassifier(class_weight={0: 0.19, 1: 0.81}, random_state=101)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 20),
"min_samples_leaf": [2, 5, 7, 10, 15],
"max_leaf_nodes": [2, 3, 5, 10, 15],
"min_impurity_decrease": [0.0001, 0.001, 0.01, 0.1],
"criterion": ["entropy", "gini"],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer, n_jobs=-1, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.19, 1: 0.81}, max_depth=7,
max_leaf_nodes=15, min_impurity_decrease=0.0001,
min_samples_leaf=2, random_state=101)
# Calculating different metrics
dtree_tuned_model_train_perf = model_performance_classification_sklearn(
dtree_tuned, X_train, y_train
)
print("Training performance:\n", dtree_tuned_model_train_perf)
dtree_tuned_model_test_perf = model_performance_classification_sklearn(
dtree_tuned, X_test, y_test
)
print("Testing performance:\n", dtree_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(dtree_tuned, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 0.752414 0.720062 0.410097 0.522573
Testing performance:
Accuracy Recall Precision F1
0 0.74471 0.647273 0.391209 0.487671
# Fitting the model
rf_estimator = RandomForestClassifier(random_state=101)
rf_estimator.fit(X_train, y_train)
# Calculating different metrics
rf_estimator_model_train_perf = model_performance_classification_sklearn(
rf_estimator, X_train, y_train
)
print("Training performance:\n", rf_estimator_model_train_perf)
rf_estimator_model_test_perf = model_performance_classification_sklearn(
rf_estimator, X_test, y_test
)
print("Testing performance:\n", rf_estimator_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(rf_estimator, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.881911 0.443636 0.859155 0.585132
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(class_weight={0: 0.19, 1: 0.81}, random_state=101)
parameters = {
"max_depth": list(np.arange(5, 20, 5)) + [None],
"max_features": ["sqrt", "log2", None],
"max_samples": [0.3, 0.7, 1.0],
"n_estimators": np.arange(10, 200, 50),
"min_samples_leaf": np.arange(2, 10),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.19, 1: 0.81}, max_depth=5,
max_features='sqrt', max_samples=1.0, min_samples_leaf=6,
n_estimators=10, random_state=101)
# Calculating different metrics
rf_tuned_model_train_perf = model_performance_classification_sklearn(
rf_tuned, X_train, y_train
)
print("Training performance:\n", rf_tuned_model_train_perf)
rf_tuned_model_test_perf = model_performance_classification_sklearn(
rf_tuned, X_test, y_test
)
print("Testing performance:\n", rf_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(rf_tuned, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 0.788704 0.690513 0.459152 0.551553
Testing performance:
Accuracy Recall Precision F1
0 0.787713 0.603636 0.451087 0.51633
# Fitting the model
bagging_classifier = BaggingClassifier(random_state=101)
bagging_classifier.fit(X_train, y_train)
# Calculating different metrics
bagging_classifier_model_train_perf = model_performance_classification_sklearn(
bagging_classifier, X_train, y_train
)
print("Training performance:\n", bagging_classifier_model_train_perf)
bagging_classifier_model_test_perf = model_performance_classification_sklearn(
bagging_classifier, X_test, y_test
)
print("Testing performance:\n", bagging_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(bagging_classifier, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 0.992391 0.96112 0.998384 0.979398
Testing performance:
Accuracy Recall Precision F1
0 0.900341 0.578182 0.84127 0.685345
# Choose the type of classifier.
bagging_tuned = BaggingClassifier(
DecisionTreeClassifier(class_weight={0: 0.19, 1: 0.81}, random_state=101),
random_state=101,
)
# Grid of parameters to choose from
parameters = {
"max_samples": [0.7, 0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
"n_estimators": [10, 20, 30, 40, 50],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(bagging_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
bagging_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_tuned.fit(X_train, y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.19,
1: 0.81},
random_state=101),
max_features=0.9, max_samples=0.9, n_estimators=30,
random_state=101)
# Calculating different metrics
bagging_tuned_model_train_perf = model_performance_classification_sklearn(
bagging_tuned, X_train, y_train
)
print("Training performance:\n", bagging_tuned_model_train_perf)
bagging_tuned_model_test_perf = model_performance_classification_sklearn(
bagging_tuned, X_test, y_test
)
print("Testing performance:\n", bagging_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(bagging_tuned, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 0.996488 0.982893 0.99842 0.990596
Testing performance:
Accuracy Recall Precision F1
0 0.887372 0.476364 0.861842 0.613583
# Fitting the model
ab_classifier = AdaBoostClassifier(random_state=101)
ab_classifier.fit(X_train, y_train)
# Calculating different metrics
ab_classifier_model_train_perf = model_performance_classification_sklearn(
ab_classifier, X_train, y_train
)
print(ab_classifier_model_train_perf)
ab_classifier_model_test_perf = model_performance_classification_sklearn(
ab_classifier, X_test, y_test
)
print(ab_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(ab_classifier, X_test, y_test)
Accuracy Recall Precision F1 0 0.846064 0.293935 0.724138 0.418142 Accuracy Recall Precision F1 0 0.845051 0.272727 0.735294 0.397878
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=101)
# Grid of parameters to choose from
parameters = {
# Let's try different max_depth for base_estimator
"base_estimator": [
DecisionTreeClassifier(max_depth=1),
DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3),
],
"n_estimators": np.arange(10, 110, 20),
"learning_rate": np.arange(0.1, 2, 0.1),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=0.8, n_estimators=90, random_state=101)
# Calculating different metrics
abc_tuned_model_train_perf = model_performance_classification_sklearn(
abc_tuned, X_train, y_train
)
print(abc_tuned_model_train_perf)
abc_tuned_model_test_perf = model_performance_classification_sklearn(
abc_tuned, X_test, y_test
)
print(abc_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(abc_tuned, X_test, y_test)
Accuracy Recall Precision F1 0 0.965759 0.869362 0.944257 0.905263 Accuracy Recall Precision F1 0 0.886007 0.56 0.77 0.648421
# Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=101)
gb_classifier.fit(X_train, y_train)
# Calculating different metrics
gb_classifier_model_train_perf = model_performance_classification_sklearn(
gb_classifier, X_train, y_train
)
print("Training performance:\n", gb_classifier_model_train_perf)
gb_classifier_model_test_perf = model_performance_classification_sklearn(
gb_classifier, X_test, y_test
)
print("Testing performance:\n", gb_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(gb_classifier, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 0.875622 0.416796 0.842767 0.557752
Testing performance:
Accuracy Recall Precision F1
0 0.853925 0.316364 0.769912 0.448454
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=101), random_state=101
)
# Grid of parameters to choose from
parameters = {
"n_estimators": [100, 150, 200, 250],
"subsample": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=101),
max_features=0.7, n_estimators=250, random_state=101,
subsample=1)
# Calculating different metrics
gbc_tuned_model_train_perf = model_performance_classification_sklearn(
gbc_tuned, X_train, y_train
)
print("Training performance:\n", gbc_tuned_model_train_perf)
gbc_tuned_model_test_perf = model_performance_classification_sklearn(
gbc_tuned, X_test, y_test
)
print("Testing performance:\n", gbc_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(gbc_tuned, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 0.906351 0.556765 0.910941 0.69112
Testing performance:
Accuracy Recall Precision F1
0 0.864164 0.4 0.763889 0.52506
# Fitting the model
xgb_classifier = XGBClassifier(random_state=101, eval_metric="logloss")
xgb_classifier.fit(X_train, y_train)
# Calculating different metrics
xgb_classifier_model_train_perf = model_performance_classification_sklearn(
xgb_classifier, X_train, y_train
)
print("Training performance:\n", xgb_classifier_model_train_perf)
xgb_classifier_model_test_perf = model_performance_classification_sklearn(
xgb_classifier, X_test, y_test
)
print("Testing performance:\n", xgb_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(xgb_classifier, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 0.997659 0.987558 1.0 0.99374
Testing performance:
Accuracy Recall Precision F1
0 0.894881 0.56 0.823529 0.666667
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=101, eval_metric="logloss")
# Grid of parameters to choose from
parameters = {
"n_estimators": [10, 30, 50],
"scale_pos_weight": [1, 2, 5],
"subsample": [0.7, 0.9, 1],
"learning_rate": [0.05, 0.1, 0.2],
"colsample_bytree": [0.7, 0.9, 1],
"colsample_bylevel": [0.5, 0.7, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=0.5, colsample_bynode=None,
colsample_bytree=0.7, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=30, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=101, ...)
# Calculating different metrics
xgb_tuned_model_train_perf = model_performance_classification_sklearn(
xgb_tuned, X_train, y_train
)
print("Training performance:\n", xgb_tuned_model_train_perf)
xgb_tuned_model_test_perf = model_performance_classification_sklearn(
xgb_tuned, X_test, y_test
)
print("Testing performance:\n", xgb_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(xgb_tuned, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 0.840211 0.835148 0.549642 0.662963
Testing performance:
Accuracy Recall Precision F1
0 0.817747 0.72 0.510309 0.597285
# we will combine models that have generalized performance, and the better recall
estimators = [
("Decision Tree", dtree_tuned),
("Random Forest", rf_tuned),
("Gradient Boosting", gbc_tuned),
]
final_estimator = xgb_tuned
stacking_classifier = StackingClassifier(
estimators=estimators, final_estimator=final_estimator
)
stacking_classifier.fit(X_train, y_train)
StackingClassifier(estimators=[('Decision Tree',
DecisionTreeClassifier(class_weight={0: 0.19,
1: 0.81},
max_depth=7,
max_leaf_nodes=15,
min_impurity_decrease=0.0001,
min_samples_leaf=2,
random_state=101)),
('Random Forest',
RandomForestClassifier(class_weight={0: 0.19,
1: 0.81},
max_depth=5,
max_features='sqrt',
max_samples=1.0,
min_samples_leaf=6,
n_estimators=10,
ran...
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.05,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
n_estimators=30, n_jobs=None,
num_parallel_tree=None,
predictor=None,
random_state=101, ...))
# Calculating different metrics
stacking_classifier_model_train_perf = model_performance_classification_sklearn(
stacking_classifier, X_train, y_train
)
print("Training performance:\n", stacking_classifier_model_train_perf)
stacking_classifier_model_test_perf = model_performance_classification_sklearn(
stacking_classifier, X_test, y_test
)
print("Testing performance:\n", stacking_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(stacking_classifier, X_test, y_test)
Training performance:
Accuracy Recall Precision F1
0 0.842552 0.808709 0.55615 0.659062
Testing performance:
Accuracy Recall Precision F1
0 0.831399 0.734545 0.537234 0.620584
# training performance comparison
models_train_comp_df = pd.concat(
[
dtree_model_train_perf.T,
dtree_tuned_model_train_perf.T,
rf_estimator_model_train_perf.T,
rf_tuned_model_train_perf.T,
bagging_classifier_model_train_perf.T,
bagging_tuned_model_train_perf.T,
ab_classifier_model_train_perf.T,
abc_tuned_model_train_perf.T,
gb_classifier_model_train_perf.T,
gbc_tuned_model_train_perf.T,
xgb_classifier_model_train_perf.T,
xgb_tuned_model_train_perf.T,
stacking_classifier_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Decision Tree Tuned",
"Random Forest",
"Random Forest Tuned",
"Bagging Classifier",
"Bagging Classifier Tuned",
"Adaboost Classifier",
"Adabosst Classifier Tuned",
"Gradient Boost Classifier",
"Gradient Boost Classifier Tuned",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree | Decision Tree Tuned | Random Forest | Random Forest Tuned | Bagging Classifier | Bagging Classifier Tuned | Adaboost Classifier | Adabosst Classifier Tuned | Gradient Boost Classifier | Gradient Boost Classifier Tuned | XGBoost Classifier | XGBoost Classifier Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 1.0 | 0.752414 | 1.0 | 0.788704 | 0.992391 | 0.996488 | 0.846064 | 0.965759 | 0.875622 | 0.906351 | 0.997659 | 0.840211 | 0.842552 |
| Recall | 1.0 | 0.720062 | 1.0 | 0.690513 | 0.961120 | 0.982893 | 0.293935 | 0.869362 | 0.416796 | 0.556765 | 0.987558 | 0.835148 | 0.808709 |
| Precision | 1.0 | 0.410097 | 1.0 | 0.459152 | 0.998384 | 0.998420 | 0.724138 | 0.944257 | 0.842767 | 0.910941 | 1.000000 | 0.549642 | 0.556150 |
| F1 | 1.0 | 0.522573 | 1.0 | 0.551553 | 0.979398 | 0.990596 | 0.418142 | 0.905263 | 0.557752 | 0.691120 | 0.993740 | 0.662963 | 0.659062 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
dtree_model_test_perf.T,
dtree_tuned_model_test_perf.T,
rf_estimator_model_test_perf.T,
rf_tuned_model_test_perf.T,
bagging_classifier_model_test_perf.T,
bagging_tuned_model_test_perf.T,
ab_classifier_model_test_perf.T,
abc_tuned_model_test_perf.T,
gb_classifier_model_test_perf.T,
gbc_tuned_model_test_perf.T,
xgb_classifier_model_test_perf.T,
xgb_tuned_model_test_perf.T,
stacking_classifier_model_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree",
"Decision Tree Tuned",
"Random Forest",
"Random Forest Tuned",
"Bagging Classifier",
"Bagging Classifier Tuned",
"Adaboost Classifier",
"Adabosst Classifier Tuned",
"Gradient Boost Classifier",
"Gradient Boost Classifier Tuned",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Decision Tree | Decision Tree Tuned | Random Forest | Random Forest Tuned | Bagging Classifier | Bagging Classifier Tuned | Adaboost Classifier | Adabosst Classifier Tuned | Gradient Boost Classifier | Gradient Boost Classifier Tuned | XGBoost Classifier | XGBoost Classifier Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.874403 | 0.744710 | 0.881911 | 0.787713 | 0.900341 | 0.887372 | 0.845051 | 0.886007 | 0.853925 | 0.864164 | 0.894881 | 0.817747 | 0.831399 |
| Recall | 0.654545 | 0.647273 | 0.443636 | 0.603636 | 0.578182 | 0.476364 | 0.272727 | 0.560000 | 0.316364 | 0.400000 | 0.560000 | 0.720000 | 0.734545 |
| Precision | 0.669145 | 0.391209 | 0.859155 | 0.451087 | 0.841270 | 0.861842 | 0.735294 | 0.770000 | 0.769912 | 0.763889 | 0.823529 | 0.510309 | 0.537234 |
| F1 | 0.661765 | 0.487671 | 0.585132 | 0.516330 | 0.685345 | 0.613583 | 0.397878 | 0.648421 | 0.448454 | 0.525060 | 0.666667 | 0.597285 | 0.620584 |
# feature importances
feature_names = X_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Data Background:
Data Preprocessing:
Observations from EDA:
Customer Profiles:
Model Building and Performance:
Models were created to predict whether or not a customer will purchase a package. Recall was the chosen model evaluation metric to minimize false negatives. A decision tree, random forest, bagging classifier, AdaBoost classifier, Gradient Boosting classifier, XGBoost Classifier were built and tuned with hyperparameters. A stacking model was built that combined the best individual models. The final performance gave 0.72 on the test set. Based on the chosen XGBoost model, having a passport, being an Executive, being single and living in a Tier 3 city are the most significant variables for determining whether a customer will purchase a package.